Registered S3 methods overwritten by 'htmltools':
method from
print.html tools:rstudio
print.shiny.tag tools:rstudio
print.shiny.tag.list tools:rstudio
Registered S3 method overwritten by 'rmarkdown':
method from
print.paged_df
knitr::include_graphics("hero-img-3980048463.png")
This is a simple project which I undertook last year (based on a DataCamps project), which aims to introduce simple data analysis and machine learning concepts using a data set of one of my favorite video-game franchises. The key objectives of the project are the following:
To develop foundation skills in data cleaning, analysis and visualization.
To familiarize myself with the R language.
To develop simple machine learning models utilizing the Random Forest and Decision Tree algorithms in order to predict whether Pokémon are legendary or not.
Hyperlink to data set: https://www.kaggle.com/datasets/rounakbanik/pokemon
Hyperlink to banner picture: https://pokemonletsgo.pokemon.com/en-gb/how-to-play/
library(datasets)
library("tidyverse")
library(ggplot2)
library(rgl)
library(plotly)
library(dplyr)
library(gridExtra)
library(RColorBrewer)
library(ggrepel)
library(caret)
library(e1071)
library(randomForest)
library(tree)
library(pROC)
library(RColorBrewer)
df <- read.csv("pokemon.csv")
Here I’m selecting the most relevant metrics which I want to analyse, alongside ones which will be used for the machine learning models, to predict whether a Pokemon is legendary or not. I’m leaving out metrics such as the following: capture_rate, abilities, experience growth, base happiness, percentage growth and type 2.
df <- df %>%
select(name, type1, attack, defense, height_m, sp_attack, sp_defense, speed, weight_kg, generation, is_legendary, hp)
head(df)
Generations in Pokémon simply refers to what generation of games a particular Pokémon was introduced in (e.g. Pikachu was introduced in Generation I, whilst Dialga was in IV). The generation with the most legendary Pokémon is the seventh, which has 17.
pokemon_by_gen <- df %>%
group_by(generation) %>%
summarize(Count = n())
colors <- c('rgba(93, 164, 214, 0.8)', 'rgba(255, 144, 14, 0.8)', 'rgba(44, 160, 101, 0.8)',
'rgba(255, 65, 54, 0.8)', 'rgba(207, 114, 255, 0.8)', 'rgba(127, 96, 0, 0.8)')
plot_ly(pokemon_by_gen, x = ~generation, y = ~Count,
type = 'bar',
marker = list(color = colors, opacity = 0.8),
text = ~paste('Generation:', generation, 'Count:', Count),
hoverinfo = 'text',
textposition = 'none') %>%
layout(title = list(text = 'Pokémon per Generation', font = list(size = 24)),
xaxis = list(title = 'Generation', titlefont = list(size = 18), tickfont = list(size = 14)),
yaxis = list(title = 'Count', titlefont = list(size = 18), tickfont = list(size = 14)),
plot_bgcolor = 'rgba(245, 246, 249, 1)',
paper_bgcolor = 'rgba(245, 246, 249, 1)',
margin = list(l = 50, r = 50, b = 100, t = 100, pad = 4))
pokemon_by_gen_legendary <- df %>%
filter(is_legendary == TRUE) %>%
group_by(generation) %>%
summarize(Count = n())
plot_ly(pokemon_by_gen_legendary, x = ~generation, y = ~Count,
type = 'bar',
marker = list(color = colors, opacity = 0.8),
text = ~paste('Generation:', generation, 'Count:', Count),
hoverinfo = 'text',
textposition = 'none') %>%
layout(title = list(text = 'Legendary Pokémon per Generation', font = list(size = 24)),
xaxis = list(title = 'Generation', titlefont = list(size = 18), tickfont = list(size = 14)),
yaxis = list(title = 'Count', titlefont = list(size = 18), tickfont = list(size = 14)),
plot_bgcolor = 'rgba(245, 246, 249, 1)',
paper_bgcolor = 'rgba(245, 246, 249, 1)',
margin = list(l = 50, r = 50, b = 100, t = 100, pad = 4))
As shown by the graph below, the most common type of legendary Pokemon is psychic, with as of generation 7, the absence of poison or fighting legends.
pokemon_by_type <- df %>%
group_by(type1) %>%
summarize(Count = n())
plot_ly(pokemon_by_type, x = ~type1, y = ~Count,
type = 'bar',
marker = list(color = ~Count, colorscale = 'Viridis', opacity = 0.8),
text = ~paste('Type:', type1, '<br>Count:', Count),
hoverinfo = 'text',
textposition = 'none') %>%
layout(title = list(text = 'Pokémon by Type', font = list(size = 24)),
xaxis = list(title = 'Type', titlefont = list(size = 18), tickfont = list(size = 14)),
yaxis = list(title = 'Count', titlefont = list(size = 18), tickfont = list(size = 14)),
plot_bgcolor = 'rgba(245, 246, 249, 1)',
paper_bgcolor = 'rgba(245, 246, 249, 1)',
margin = list(l = 50, r = 50, b = 100, t = 100, pad = 4))
pokemon_by_type <- df %>%
filter(is_legendary == 1) %>%
group_by(type1) %>%
summarize(Count = n())
# Rest of the code remains the same
plot_ly(pokemon_by_type, x = ~type1, y = ~Count,
type = 'bar',
marker = list(color = ~Count, colorscale = 'Viridis', opacity = 0.8),
text = ~paste('Type:', type1, '<br>Count:', Count),
hoverinfo = 'text',
textposition = 'none') %>%
layout(title = list(text = 'Legendary Pokémon by Type', font = list(size = 24)),
xaxis = list(title = 'Type', titlefont = list(size = 18), tickfont = list(size = 14)),
yaxis = list(title = 'Count', titlefont = list(size = 18), tickfont = list(size = 14)),
plot_bgcolor = 'rgba(245, 246, 249, 1)',
paper_bgcolor = 'rgba(245, 246, 249, 1)',
margin = list(l = 50, r = 50, b = 100, t = 100, pad = 4))
num_legendary_pokemon <- df %>%
filter(is_legendary == 1) %>%
nrow()
legendary_pokemon <- df %>%
filter(is_legendary == 1)
pokemon_type <- df %>%
group_by(is_legendary) %>%
summarize(count = n())
plot_ly(pokemon_type, labels = ~factor(is_legendary), values = ~count, type = 'pie',
marker = list(colors = c('purple', 'yellow')),
textinfo = 'label+value+percent') %>%
layout(title = '')
As shown by the graph below, which explores the relationship between height (m) and weight (kg); legendary Pokémon compared to their regular counterparts are generally much heavier and taller with some exceptions including: Onix, Steelix and Wailord.
top_5_heaviest <- df %>%
top_n(5, weight_kg)
top_5_tallest <- df %>%
top_n(5, height_m)
combined_top_5 <- rbind(top_5_heaviest, top_5_tallest)
ggplot(df, aes(x = height_m, y = weight_kg, color = factor(is_legendary))) +
geom_point() +
geom_text_repel(data = combined_top_5, aes(label = name), size = 3, nudge_x = 0.2, nudge_y = 0.2) +
labs(title = "", x = "Height (meters)", y = "Weight (kilograms)") +
scale_color_manual(values = c("1" = "blue", "0" = "red"), labels = c("Regular", "Legendary"))
The below graphs illustrate the relationship between Speed and various other attributes (Attack, Defense, Height, Weight, Special Attack, Special Defense) for Regular and Legendary Pokémon. Each graph consists of a scatter plot where the x-axis represents the attribute and the y-axis represents Speed. The points are color-coded to distinguish between Regular and Legendary Pokémon.
The General Trends include the following:
Legendary Pokémon exhibit a strong positive correlation between speed and both attack and special attack for legendary Pokémon, thus suggesting Legendary Pokémon with higher attack or special attack values tend to have correspondingly higher speed.
A weak negative correlation is evident between speed and both defense and special defense for both regular and legendary Pokémon; indicating that those with higher defence stats generally have lower speed.
speed_vs_attack <- ggplot(df, aes(x = attack, y = speed, color = factor(is_legendary))) +
geom_point() +
labs(title = "Speed vs. Attack", x = "Attack", y = "Speed") +
scale_color_manual(values = c("1" = "blue", "0" = "red"), labels = c("Regular", "Legendary"))
speed_vs_defense <- ggplot(df, aes(x = defense, y = speed, color = factor(is_legendary))) +
geom_point() +
labs(title = "Speed vs. Defense", x = "Defense", y = "Speed") +
scale_color_manual(values = c("1" = "blue", "0" = "red"), labels = c("Regular", "Legendary"))
speed_vs_height <- ggplot(df, aes(x = height_m, y = speed, color = factor(is_legendary))) +
geom_point() +
labs(title = "Speed vs. Height", x = "Height", y = "Speed") +
scale_color_manual(values = c("1" = "blue", "0" = "red"), labels = c("Regular", "Legendary"))
speed_vs_weight <- ggplot(df, aes(x = weight_kg, y = speed, color = factor(is_legendary))) +
geom_point() +
labs(title = "Speed vs. Weight", x = "Weight", y = "Speed") +
scale_color_manual(values = c("1" = "blue", "0" = "red"), labels = c("Regular", "Legendary"))
speed_vs_spattack <- ggplot(df, aes(x = sp_attack, y = speed, color = factor(is_legendary))) +
geom_point() +
labs(title = "Speed vs. Special Attack", x = "Special Attack", y = "Speed") +
scale_color_manual(values = c("1" = "blue", "0" = "red"), labels = c("Regular", "Legendary"))
speed_vs_spdefence <- ggplot(df, aes(x = sp_attack, y = speed, color = factor(is_legendary))) +
geom_point() +
labs(title = "Speed vs. Special Defence", x = "Special Defence", y = "Speed") +
scale_color_manual(values = c("1" = "blue", "0" = "red"), labels = c("Regular", "Legendary"))
grid.arrange(speed_vs_attack, speed_vs_defense, speed_vs_height, speed_vs_weight, speed_vs_spattack, speed_vs_spdefence, ncol = 2)
The following box plot diagrams illustrate the distribution of various attributes for both regular and legendary Pokémon. Each plot consists of two boxes, one for Regular Pokémon (red) and one for Legendary Pokémon (blue). The boxes represent the interquartile range (IQR), while the lines extending from the boxes indicate the range of the data excluding outliers. The dots represent individual data points.
Overall, the plots suggest that Legendary Pokémon tend to have higher values for most attributes compared to Regular Pokémon. This is particularly evident for Attack, Special Attack, and Speed, where the median and upper quartile of Legendary Pokémon are significantly higher.
attack <- ggplot(na.omit(df), aes(x = factor(is_legendary), y = attack, fill = factor(is_legendary))) +
geom_boxplot() +
labs(title = "", x = "Legendary Status", y = "Attack") +
scale_x_discrete(labels = c("Regular", "Legendary")) +
scale_fill_manual(values = c("1" = "blue", "0" = "red"), labels = c("Regular", "Legendary")) +
theme_bw()
defence <- ggplot(na.omit(df), aes(x = factor(is_legendary), y = defense, fill = factor(is_legendary))) +
geom_boxplot() +
labs(title = "", x = "Legendary Status", y = "Defense") +
scale_x_discrete(labels = c("Regular", "Legendary")) +
scale_fill_manual(values = c("1" = "blue", "0" = "red"), labels = c("Regular", "Legendary")) +
theme_bw()
sp_attack <- ggplot(na.omit(df), aes(x = factor(is_legendary), y = sp_attack, fill = factor(is_legendary))) +
geom_boxplot() +
labs(title = "", x = "Legendary Status", y = "Special Attack") +
scale_x_discrete(labels = c("Regular", "Legendary")) +
scale_fill_manual(values = c("1" = "blue", "0" = "red"), labels = c("Regular", "Legendary")) +
theme_bw()
sp_defence <- ggplot(na.omit(df), aes(x = factor(is_legendary), y = sp_defense, fill = factor(is_legendary))) +
geom_boxplot() +
labs(title = "", x = "Legendary Status", y = "Special Defence") +
scale_x_discrete(labels = c("Regular", "Legendary")) +
scale_fill_manual(values = c("1" = "blue", "0" = "red"), labels = c("Regular", "Legendary")) +
theme_bw()
speeed <- ggplot(na.omit(df), aes(x = factor(is_legendary), y = speed, fill = factor(is_legendary))) +
geom_boxplot() +
labs(title = "", x = "Legendary Status", y = "Speed") +
scale_x_discrete(labels = c("Regular", "Legendary")) +
scale_fill_manual(values = c("1" = "blue", "0" = "red"), labels = c("Regular", "Legendary")) +
theme_bw()
hpp <- ggplot(na.omit(df), aes(x = factor(is_legendary), y = hp, fill = factor(is_legendary))) +
geom_boxplot() +
labs(title = "", x = "Legendary Status", y = "HP") +
scale_x_discrete(labels = c("Regular", "Legendary")) +
scale_fill_manual(values = c("1" = "blue", "0" = "red"), labels = c("Regular", "Legendary")) +
theme_bw()
grid.arrange(attack, defence, sp_attack, sp_defence, speeed, hpp, ncol = 2)
Training Data: 534 Pokémon
Testing Data: 267 Pokémon
training_data <- sample(1:nrow(df), 2 * nrow(df) / 3)
testing_data <- setdiff(1:nrow(df), training_data)
legendary_test <- df$is_legendary[testing_data]
“Decision tree learning is a supervised learning approach used in statistics, data mining and machine learning. In this formalism, a classification or regression decision tree is used as a predictive model to draw conclusions about a set of observations. Decision trees are among the most popular machine learning algorithms given their intelligibility and simplicity” (https://en.wikipedia.org/wiki/Decision_tree_learning).
The decision tree for predicting whether Pokémon are legendary begins at the root with the special attack attribute. It determines whether the Pokémon’s special attack stat is greater or less than 71.5. If it’s less than 71.5, the Pokémon is automatically deemed to not be legendary. If it is, the tree branches off into different attributes including defense and speed. Special attack is the most important metric, as it forms the root of the tree, it’s the first gate which determines whether a Pokémon should be evaluated further. Defense and speed are also important metrics for how this model determines if a Pokémon is legendary.
Ultimately, this systematic breakdown navigates through various metrics, with each one acting as a gate leading to the Pokémon’s classification. Each decision point in the tree filters the Pokémon through key criteria, with only those that meet all the thresholds being deemed legendary.
Accuracy of Decision Tree Model: 0.895
set.seed(200)
df$is_legendary <- as.factor(df$is_legendary)
pokemon_tree <- tree(is_legendary ~ ., data = df[training_data, ], na.action = na.omit)
Warning: NAs introduced by coercion
colors <- brewer.pal(3, "Set3")
plot(pokemon_tree, col = colors, lwd = 2, main = "Decision Tree for Legendary Pokémon Prediction")
text(pokemon_tree, pretty = 0, col = "blue", cex = 0.45)
grid()
predictions <- predict(pokemon_tree, newdata = df[testing_data, ], type = "class")
Warning: NAs introduced by coercion
confusion_matrix <- table(predictions, legendary_test)
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
cat("Accuracy:", accuracy, "\n")
Accuracy: 0.8951311
“Random forests or random decision forests is an ensemble learning method for classification, regression and other tasks that works by creating a multitude of decision trees during training. For classification tasks, the output of the random forest is the class selected by most trees. For regression tasks, the output is the average of the predictions of the trees (https://en.wikipedia.org/wiki/Random_forest).”
df$is_legendary <- as.factor(df$is_legendary)
pokemon_rf <- randomForest(is_legendary ~ ., data = df[training_data, ], importance = TRUE, na.action = na.omit, type = "classification")
print(pokemon_rf)
Call:
randomForest(formula = is_legendary ~ ., data = df[training_data, ], importance = TRUE, type = "classification", na.action = na.omit)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 3
OOB estimate of error rate: 6.32%
Confusion matrix:
0 1 class.error
0 477 4 0.008316008
1 29 12 0.707317073
The confusion matrix indicates that:
477 instances were correctly classified as class 0.
4 instances were incorrectly classified as class 0 (false negatives).
29 instances were incorrectly classified as class 1 (false positives).
12 instances were correctly classified as class 1.
The class error rates for each class are also provided:
Class 0: 0.00845666 (approximately 0.83%)
Class 1: 0.65909091 (approximately 68.29%)
Accuracy = (True Positives + True Negatives) / (Total Instances)
Accuracy = (12 + 477) / 522 ≈ 0.9377
Accuracy of Random Forest Model: 0.937
Mean Decrease Accuracy: a metric that measures how much a model’s accuracy decreases when a variable is removed.
Mean Decrease Gini: measures how important a variable in a random forest model, by quantifying how much it contributes to the homogeneity of the model’s nodes.
As shown by the output below, the most crucial three metrics used by the Random Forest model to successfully predict whether a Pokémon are special attack, weight and speed.
varImpPlot(pokemon_rf)
knitr::include_graphics("legendary-pokemon-55326671.jpg")
In conclusion, as shown by the ROC Curve graph below, the Random Forest model (blue) is clearly more accurate than the Decision Tree model (red). The closer the curve is to the top left corner, the more accurate and efficient the model is. The Random Forest model achieved higher sensitivity and specificty compared to the other one, which thus indicates that it’s more efficient at correctly classifying both legendary and regular Pokémon.
The higher accuracy, is due to the nature of the Random Forest algorithm, which combined multiple trees which produced more reliable predictions; the Decision Tree algorithm is much more simplistic and lacks the robustness as Random Forest. Through comparing two different models, it highlighted the importance of model selection and the impact of algorithm complexity on predictive performance.
The most crucial metrics across both models in determining Legendary Pokémon are the following:
Special Attack
Weight (KG)
Speed
Overall this simplistic project explored and introduced the fundamental concepts of data analysis, visualization and machine learning with R. It acted as a simplistic gateway offering practical, hands-on experience in applying these techniques. The skills and insights gained here lay a solid foundation for more advanced explorations in the exciting fields of data analysis and data science.
# Predict probabilities for both models
rf_prob <- predict(pokemon_rf, df[testing_data, ], type = "prob")[, 2]
tree_prob <- predict(pokemon_tree, df[testing_data, ], type = "vector")[, 2]
Warning: NAs introduced by coercion
# Generate ROC curves
rf_roc <- roc(df$is_legendary[testing_data], rf_prob)
Setting levels: control = 0, case = 1
Setting direction: controls < cases
tree_roc <- roc(df$is_legendary[testing_data], tree_prob)
Setting levels: control = 0, case = 1
Setting direction: controls < cases
# Plot ROC curves
plot(rf_roc, col = "blue", main = "ROC Curves for Random Forest and Decision Tree")
lines(tree_roc, col = "red")
legend("bottomright", legend = c("Random Forest", "Decision Tree"), col = c("blue", "red"), lwd = 2)